Enable split mode graph for on-the-fly merged up/gate experts by ikawrakow · Pull Request #1413 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-03-12T17:20:45Z

Once at it, this PR is a follow up of #1412. It enables usage of on-the-fly merged ffn_up/gate_exps tensors (-muge command line option) with split mode graph.

On a 2x3090 system, I see ~10% better PP for the few models I tested.

As a reminder: add -sm graph -muge to the command line to get the benefit of this PR.

Here a sweep-bench for GPT-OSS-20B-MXFP4 on the 2x3090 system. The llama.cpp results are with build 8314.

…ow#1413 Split mode graph for on-the-fly merged ffn_up/gate_exps Cleanup Also handle merged bias

ubergarm · 2026-03-12T22:25:47Z

This quick test is showing -muge giving a boost +6.8% at short kv-cache depth and +2.8% faster near 128k depth with my Qwen3.5-122B-A10B-IQ4_KSS. Its been running well doing some light testing today including mmproj. 🚀

sweep-bench-Qwen3 5-122B-A10B-IQ4_KSS-PR1413

👈 Details

title: "ik_llama.cpp PR1413 ik/sm_graph_muge@c046f7f3"
subtitle: "ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)"
hardware: "2x RTX A6000 (48GB VRAM each) Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!\n"

-sm graph

model=/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -sm graph \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	1.348	3039.33	1.650	77.58
4096	128	4096	1.400	2925.21	1.660	77.10
4096	128	8192	1.450	2825.11	1.677	76.31
4096	128	12288	1.504	2723.43	1.710	74.86
4096	128	16384	1.570	2609.23	1.716	74.59
4096	128	20480	1.629	2514.64	1.725	74.21
4096	128	24576	1.688	2426.82	1.748	73.22
4096	128	28672	1.747	2345.06	1.754	72.96
4096	128	32768	1.797	2278.90	1.780	71.91
4096	128	36864	1.859	2203.43	1.786	71.69
4096	128	40960	1.914	2139.97	1.792	71.42
4096	128	45056	1.967	2082.38	1.817	70.43
4096	128	49152	2.023	2024.23	1.824	70.19
4096	128	53248	2.069	1979.35	1.833	69.82
4096	128	57344	2.113	1938.40	1.852	69.10
4096	128	61440	2.169	1888.08	1.859	68.87
4096	128	65536	2.218	1846.38	1.881	68.05
4096	128	69632	2.272	1802.70	1.889	67.75
4096	128	73728	2.323	1763.21	1.893	67.63
4096	128	77824	2.382	1719.76	1.921	66.64
4096	128	81920	2.433	1683.82	1.927	66.42
4096	128	86016	2.478	1652.73	1.943	65.89
4096	128	90112	2.530	1619.04	1.958	65.37
4096	128	94208	2.581	1586.89	1.966	65.12
4096	128	98304	2.637	1553.40	1.988	64.40
4096	128	102400	2.695	1520.12	1.994	64.19
4096	128	106496	2.741	1494.57	2.003	63.90
4096	128	110592	2.794	1466.23	2.024	63.26
4096	128	114688	2.847	1438.82	2.028	63.13
4096	128	118784	2.899	1413.02	2.054	62.32
4096	128	122880	2.954	1386.81	2.059	62.17
4096	128	126976	3.002	1364.55	2.063	62.03
4096	128	131072	3.056	1340.13	2.087	61.34

-sm graph -muge

model=/ubergarm/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 135168 \
  -sm graph \
  -muge \
  -ngl 999 \
  -ub 4096 -b 4096 \
  --threads 1 \
  --no-mmap \
  -n 128 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	1.262	3246.57	1.641	77.98
4096	128	4096	1.313	3120.22	1.646	77.79
4096	128	8192	1.355	3022.54	1.661	77.05
4096	128	12288	1.409	2906.77	1.692	75.66
4096	128	16384	1.470	2785.99	1.700	75.28
4096	128	20480	1.526	2684.02	1.706	75.03
4096	128	24576	1.586	2582.02	1.729	74.02
4096	128	28672	1.643	2493.31	1.738	73.63
4096	128	32768	1.699	2410.52	1.766	72.48
4096	128	36864	1.753	2336.64	1.769	72.36
4096	128	40960	1.807	2266.81	1.775	72.11
4096	128	45056	1.853	2210.21	1.798	71.21
4096	128	49152	1.913	2140.88	1.803	71.01
4096	128	53248	1.960	2089.72	1.814	70.57
4096	128	57344	2.013	2034.95	1.835	69.74
4096	128	61440	2.070	1979.21	1.844	69.43
4096	128	65536	2.115	1936.36	1.867	68.57
4096	128	69632	2.174	1884.38	1.874	68.31
4096	128	73728	2.228	1838.02	1.881	68.05
4096	128	77824	2.280	1796.17	1.906	67.14
4096	128	81920	2.331	1756.85	1.908	67.09
4096	128	86016	2.384	1718.11	1.921	66.63
4096	128	90112	2.435	1681.97	1.938	66.05
4096	128	94208	2.489	1645.42	1.944	65.86
4096	128	98304	2.549	1606.77	1.967	65.06
4096	128	102400	2.596	1577.51	1.976	64.77
4096	128	106496	2.656	1541.96	1.983	64.54
4096	128	110592	2.708	1512.47	2.008	63.75
4096	128	114688	2.760	1484.08	2.013	63.59
4096	128	118784	2.812	1456.69	2.038	62.81
4096	128	122880	2.874	1425.21	2.048	62.50
4096	128	126976	2.922	1401.85	2.051	62.42
4096	128	131072	2.972	1377.98	2.076	61.65

I don't have a mainline compatible quant handy to test on this rig, but have seen that issue where sweep-bench fails towards the end on mainline too as you mentioned in another PR.

hksdpc255 · 2026-03-13T05:38:56Z

Is on-the-fly merged ffn_up/gate_exps ggufs faster than just using --merge-up-gate-experts on a non-merged ggufs?

ikawrakow · 2026-03-13T06:06:06Z

Is on-the-fly merged ffn_up/gate_exps ggufs faster than just using --merge-up-gate-experts on a non-merged ggufs?

It is the same thing. I started using "on-the-fly merged" for --merge-up-gate-experts to distinguish from the case where these tensors have already been merged in the model stored on disk (see issue #1399 that you entered yourself)

hksdpc255 · 2026-03-13T06:10:15Z

I'm confused. Is that means using pre-merged ggufs is same as using non-merged ggufs with option -muge?

ikawrakow · 2026-03-13T06:15:52Z

I'm confused. Is that means using pre-merged ggufs is same as using non-merged ggufs with option -muge?

Yes, it is the same. In the pre-merged case, someone (for instance AesSedai) has prepared the model such that the ffn_up_exps and ffn_gate_exps tensors are merged into a single ffn_gate_up_exps tensor and stored the model that way on disk. In that case, when we load the model we don't need to do anything to take advantage of the merge. With the on-the-fly merge (-muge) the model stored on disk contains separate ffn_up_exps and ffn_gate_exps tensors, and we merge them on-the-fly while loading the model. The end result (i.e., what happens during inference) is exactly the same.

hksdpc255 · 2026-03-13T06:54:31Z

CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	0.465	4400.22	3.652	140.18
2048	512	2048	0.415	4929.12	3.700	138.36
2048	512	4096	0.413	4962.78	3.740	136.88
2048	512	6144	0.419	4892.39	3.785	135.25
2048	512	8192	0.425	4820.18	3.827	133.79
2048	512	10240	0.430	4765.75	3.893	131.52
2048	512	12288	0.438	4678.00	3.939	130.00
2048	512	14336	0.445	4607.09	3.992	128.27
2048	512	16384	0.450	4555.36	4.042	126.68
2048	512	18432	0.457	4485.72	4.096	124.99
2048	512	20480	0.464	4418.29	4.153	123.28
2048	512	22528	0.469	4365.78	4.237	120.85
2048	512	24576	0.476	4299.84	4.226	121.14
2048	512	26624	0.482	4251.36	4.263	120.09
2048	512	28672	0.488	4200.72	4.290	119.35
2048	512	30720	0.493	4151.75	4.323	118.43
2048	512	32768	0.505	4055.22	4.361	117.39

CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate -muge

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	0.440	4654.74	3.676	139.28
2048	512	2048	0.393	5216.47	3.725	137.45
2048	512	4096	0.388	5274.11	3.760	136.19
2048	512	6144	0.396	5171.61	3.805	134.57
2048	512	8192	0.403	5081.13	3.850	132.97
2048	512	10240	0.408	5015.72	3.901	131.26
2048	512	12288	0.414	4941.12	3.942	129.87
2048	512	14336	0.421	4862.68	4.003	127.91
2048	512	16384	0.428	4787.63	4.057	126.20
2048	512	18432	0.435	4708.26	4.112	124.52
2048	512	20480	0.440	4649.41	4.183	122.41
2048	512	22528	0.447	4582.38	4.271	119.87
2048	512	24576	0.453	4522.47	4.298	119.13
2048	512	26624	0.459	4460.42	4.309	118.81
2048	512	28672	0.465	4407.75	4.338	118.04
2048	512	30720	0.470	4361.50	4.371	117.13
2048	512	32768	0.476	4298.20	4.402	116.30

CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-merge-gate-up-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	0.454	4507.90	3.697	138.47
2048	512	2048	0.399	5137.54	3.745	136.72
2048	512	4096	0.389	5261.75	3.782	135.37
2048	512	6144	0.395	5185.13	3.814	134.24
2048	512	8192	0.403	5086.56	3.862	132.57
2048	512	10240	0.408	5017.28	3.915	130.79
2048	512	12288	0.414	4941.87	3.952	129.55
2048	512	14336	0.422	4855.75	4.006	127.80
2048	512	16384	0.427	4795.48	4.058	126.16
2048	512	18432	0.433	4731.85	4.091	125.16
2048	512	20480	0.437	4686.27	4.154	123.26
2048	512	22528	0.444	4615.27	4.234	120.92
2048	512	24576	0.450	4549.59	4.267	120.00
2048	512	26624	0.456	4487.65	4.291	119.31
2048	512	28672	0.463	4420.85	4.315	118.65
2048	512	30720	0.469	4370.51	4.342	117.92
2048	512	32768	0.475	4307.13	4.379	116.91

CUDA_VISIBLE_DEVICES=6,7 numactl --cpunodebind=1 --membind=1 ./ik_llama.cpp-b4283/llama-sweep-bench -rtr -b 2048 -ub 2048 --split-mode graph --cache-type-k q8_0 --cache-type-v q8_0 --n-gpu-layers 999 --model Qwen3.5-35B-A3B-Q5_K_M-merge-gate-up-00001-of-00002.gguf --mmproj mmproj-Qwen3.5-35B-A3B-BF16.gguf --ctx-size 225280 --threads 8 --tensor-split 1,1 --numa isolate -muge

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	0.457	4481.63	3.707	138.11
2048	512	2048	0.399	5128.53	3.741	136.87
2048	512	4096	0.388	5284.00	3.775	135.64
2048	512	6144	0.394	5202.74	3.790	135.10
2048	512	8192	0.401	5112.52	3.829	133.73
2048	512	10240	0.406	5048.00	3.880	131.95
2048	512	12288	0.412	4971.28	3.926	130.43
2048	512	14336	0.419	4892.28	3.974	128.82
2048	512	16384	0.424	4831.77	4.030	127.05
2048	512	18432	0.431	4748.12	4.085	125.34
2048	512	20480	0.437	4686.52	4.151	123.33
2048	512	22528	0.444	4610.63	4.227	121.12
2048	512	24576	0.450	4547.48	4.280	119.63
2048	512	26624	0.457	4481.97	4.305	118.93
2048	512	28672	0.463	4420.30	4.338	118.01
2048	512	30720	0.468	4372.18	4.378	116.96
2048	512	32768	0.477	4295.49	4.409	116.14

abc-nix · 2026-03-13T10:04:51Z

In my experience, running -sm graph -muge for hybrid GPU+CPU inference (with tensor offloading) of Qwen 3.5 397B IQ4_KSS from ubergarm makes the output fall into loops. Flags used:

      -c 210000 \
      --jinja \
      -fa 1 -ngl 99 -ub 4096 -b 8192 \
      --ctx-checkpoints 12 --ctx-checkpoints-interval 16383 \
      -cuda fusion=1,offload-batch-size=4,mmq-id-size=128 \
      -gr -ger \
      --split-mode graph -smgs --graph-reduce-type f32 \
      -muge \
      -ts 35,26 \
      -ot "blk\.([0-2])\.ffn_(up|gate|down)_exps\.weight=CUDA0" \
      -ot "blk\.([5][7-9])\.ffn_(up|gate|down)_exps\.weight=CUDA1" \
      -ot "blk\.([0-9]|[1-9][0-9])\.ffn_(up|gate|down)_exps\.weight=CPU" \
      --no-warmup --no-mmap

If I don't use -muge I get proper outputs.

I suppose it is due to the "unexpected results if using custom tensor offloads with split-mode graph" warning. I am thankful that it still works very well without merging up and gate expert tensors, so this is just a drawback of using custom tensor offloading.

ubergarm · 2026-03-13T16:14:32Z

@abc-nix

I'm not 100% sure, but digging through some recent PRs on imatrix fused up|gate tensors and the original -muge PR it may be that when you use -muge that your -ot needs to change from ffn_(up|gate|down)_exps to be ffn_(gate_up|down)_exps ... might give that a try?

Same thing if you're using a mainline pre-merged quant...

ik was gracious and re-named the existing convention here to reduce confusion with the new opposite naming convention on mainlin...

so it is ffn_gate_up_exps everywhere now psure, gonna go submit a PR to add this to --cpu-moe and --n-cpu-moe regex strings now

abc-nix · 2026-03-13T17:53:52Z

@ubergarm, thanks for the tips. This will help when using models with these experts merged (I will be using a catch-all regex ffn_(gate_up|up|gate|down)_exps).

What I am seeing when offloading the up and gate tensors with a non-merged gguf (the one in your repo), is that they are merged after they are loaded to the device. If I use -ot "blk\.([0-2])\.ffn_(gate_up|up|gate|down)_exps\.weight=CUDA0", I can see:

Tensor blk.0.ffn_up_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
merge_up_gate_exps: merging up/gate in layer 0
Tensor blk.0.ffn_down_exps.weight (size = 1096.00 MiB) buffer type overriden to CUDA0
Tensor blk.1.ffn_up_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
merge_up_gate_exps: merging up/gate in layer 1
Tensor blk.1.ffn_down_exps.weight (size = 1096.00 MiB) buffer type overriden to CUDA0
Tensor blk.2.ffn_up_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_exps.weight (size = 1026.00 MiB) buffer type overriden to CUDA0
merge_up_gate_exps: merging up/gate in layer 2
Tensor blk.2.ffn_down_exps.weight (size = 1096.00 MiB) buffer type overriden to CUDA0
[...]

So that is not the issue. The output still repeats itself, so using -muge with -sm graph and partial expert offloading is bad on my machine (unless there is a conflict with a different flag).

ikawrakow added 3 commits March 12, 2026 16:37

Split mode graph for on-the-fly merged ffn_up/gate_exps

f719756

Cleanup

ec4693b

Also handle merged bias

c046f7f

Nexesenex pushed a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 12, 2026

Enable split mode graph for on-the-fly merged up/gate experts ikawrak…

d655cba

…ow#1413 Split mode graph for on-the-fly merged ffn_up/gate_exps Cleanup Also handle merged bias

ikawrakow merged commit 7fab617 into main Mar 13, 2026

ikawrakow mentioned this pull request Mar 13, 2026

Add ffn_gate_up_exps to --cpu-moe and --n-cpu-moe overrides #1422

Merged

abc-nix mentioned this pull request Mar 15, 2026

Bug: Higher PPL when combining muge, split-mode graph and partial expert offloading on Qwen 3.5 #1432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable split mode graph for on-the-fly merged up/gate experts#1413

Enable split mode graph for on-the-fly merged up/gate experts#1413
ikawrakow merged 3 commits intomainfrom
ik/sm_graph_muge

ikawrakow commented Mar 12, 2026 •

edited

Loading

Uh oh!

ubergarm commented Mar 12, 2026

title: "ik_llama.cpp PR1413 ik/sm_graph_muge@c046f7f3"
subtitle: "ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)"
hardware: "2x RTX A6000 (48GB VRAM each) Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!\n"

-sm graph

-sm graph -muge

Uh oh!

hksdpc255 commented Mar 13, 2026

Uh oh!

ikawrakow commented Mar 13, 2026

Uh oh!

hksdpc255 commented Mar 13, 2026

Uh oh!

ikawrakow commented Mar 13, 2026

Uh oh!

hksdpc255 commented Mar 13, 2026

Uh oh!

abc-nix commented Mar 13, 2026

Uh oh!

ubergarm commented Mar 13, 2026 •

edited

Loading

Uh oh!

abc-nix commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ikawrakow commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Mar 12, 2026

title: "ik_llama.cpp PR1413 ik/sm_graph_muge@c046f7f3" subtitle: "ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)" hardware: "2x RTX A6000 (48GB VRAM each) Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!\n"

-sm graph

-sm graph -muge

Uh oh!

hksdpc255 commented Mar 13, 2026

Uh oh!

ikawrakow commented Mar 13, 2026

Uh oh!

hksdpc255 commented Mar 13, 2026

Uh oh!

ikawrakow commented Mar 13, 2026

Uh oh!

hksdpc255 commented Mar 13, 2026

Uh oh!

abc-nix commented Mar 13, 2026

Uh oh!

ubergarm commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abc-nix commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ikawrakow commented Mar 12, 2026 •

edited

Loading

title: "ik_llama.cpp PR1413 ik/sm_graph_muge@c046f7f3"
subtitle: "ubergarm/Qwen3.5-122B-A10B IQ4_KSS 61.219 GiB (4.306 BPW)"
hardware: "2x RTX A6000 (48GB VRAM each) Driver: 580.105.08 CUDA: 13.0 P2P: OK NCCL found!\n"

ubergarm commented Mar 13, 2026 •

edited

Loading